Skip to content

cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332

Merged
alexcrichton merged 2 commits into
bytecodealliance:mainfrom
ggreif:gabor/clz-ctz-bool-fold
May 11, 2026
Merged

cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332
alexcrichton merged 2 commits into
bytecodealliance:mainfrom
ggreif:gabor/clz-ctz-bool-fold

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented May 9, 2026

Four mid-end ISLE rules in `opts/icmp.isle` for the boolean-context cases — when `ctz`/`clz` flows into a comparison against zero (the consumer cares only "is it zero?", not the numeric value):

```
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
```

The bit-counting instruction is DCE'd; backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of `TZCNT/BSF/LZCNT/BSR` + `TEST` + `JCC` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.

Why this matters in practice

The poster-child workload is the Motoko runtime's discriminator test on every `Nat`/`Int` operation:

  • Compact (scalar) integers: low bit clear — fast path is plain Wasm i32/i64 arithmetic.
  • Heap-allocated big integers (via libtommath): low bit set (skew tag).

Every arithmetic op begins with this LSB test. The Motoko codegen (`src/codegen/instrList.ml:97-100`) already emits the LSB-test-of-AND-1 pattern as `(ctz X)` — unconditionally, no flag gate — so every moc-compiled wasm running on wasmtime today does TZCNT + TEST + JCC on the hot path of every numeric op. The Rust RTS / GC paths that work on the same tagged pointer scheme see the same pattern.

With these rules in place, cranelift collapses the comparison back to a single `test r, 1` — restoring the original cost of the discriminator and unlocking measurable speed-ups for every Motoko canister on a wasmtime-based IC subnet (and any other wasm that produces this shape).

The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; structurally parallel rewrite, ships in the same patch.

The converse fold on the wasm-byte-savings side is in WebAssembly/binaryen#8562 (LSB→ctz under `-Os`); landing it there together with this in cranelift gives byte savings without cycle cost.

Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (`ctz(X) == 4` must not trigger — that's a numeric-value test on the count, a different rewrite family).

@ggreif ggreif requested a review from a team as a code owner May 9, 2026 15:04
@ggreif ggreif requested review from cfallin and removed request for a team May 9, 2026 15:04
@ggreif ggreif marked this pull request as draft May 9, 2026 15:08
…gn-bit tests

When the result of a count-trailing/leading-zeros instruction is fed
into a comparison against zero (the only thing the consumer cares
about is whether the count is zero, not its numeric value), rewrite
to test the corresponding bit of X directly:

  ctz(X) == 0   iff  LSB of X is set     iff  (X & 1) != 0
  clz(X) == 0   iff  MSB of X is set     iff  X is signed-negative

The bit-counting instruction can then be DCE'd. Backend emits a
single-cycle `test reg, imm` (LSB case) or `test reg, reg; js`
(sign case) instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — saves
~3 cycles of latency on Intel x86_64 per occurrence and removes
the false GPR dependency. JIT-less backends benefit even more:
their bit-counting paths are typically loops.

Motivated by the converse wasm-side peephole in
WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings).
With these mid-end rules in place, that fold is cycle-neutral on
cranelift JITs even when fed unconditionally.

Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a
negative case (ctz(X) == 4 must NOT trigger — that's a numeric-value
test on the count, a different rewrite family).
@ggreif ggreif force-pushed the gabor/clz-ctz-bool-fold branch from 1734a52 to 30531ed Compare May 9, 2026 15:14
@ggreif ggreif changed the title cranelift: fold ctz/clz comparisons against saturation values into direct LSB / null tests cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests May 9, 2026
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 9, 2026

Sketched an extension to also catch the wasm-emitted shape brif (ireduce.i32 (ctz.i64 X)) (and clz), which is what frontends like Motoko's moc produce — wasm's if takes an i32 condition, so the i64 LSB test always flows through ireduce and brif directly, with no icmp interposed for the egraph rules in this PR to match.

Scope creep: the natural place is each backend's is_nonzero helper (x64 inst.isle:3806-3826, aarch64 inst.isle:4659-4670, plus riscv64 and s390x), where rules like

(rule (is_nonzero (ctz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val (RegMemImm.Imm 1)) (CC.Z)))

(rule (is_nonzero (clz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val val) (CC.NS)))

would lower brif (ctz X) to test X, 1; jz and brif (clz X) to test X, X; jns directly, plus an (ireduce _ (ctz/clz _)) variant for the wasm path.

That's 4× backend files + filetests, different reviewers per arch, and a different review audience from this egraph PR. Punting on the amendment and filing a separate follow-up instead.

@ggreif ggreif marked this pull request as ready for review May 9, 2026 16:48
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 9, 2026

Concrete real-world workload for the clz boolean-context fold: Motoko's classical-persistence backend already emits the bare i32.clz; if shape for Int <-> 0 sign tests (e.g. Prim.abs). The compiler-side peephole (shrU 31; if -> clz; if(swap) and and 0x80000000; if -> clz; if(swap) in motoko/src/codegen/instrList.ml) generates this directly because their target is i32 wasm without a wrap.

So the JIT-side fold here is the natural meeting point for classical-Motoko output: clz directly into brif, no icmp, no ireduce. The rules in this PR don't yet catch that shape (it's brif (clz X), not icmp eq (clz X) 0), but the equivalent backend lowering (per the previous comment) would close that gap end-to-end.

@ggreif ggreif changed the title cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests May 9, 2026
@github-actions github-actions Bot added cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language labels May 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Subscribe to Label Action

cc @cfallin, @fitzgen

Details This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@alexcrichton
Copy link
Copy Markdown
Member

Thanks for this! Are the backend rules necessary with the mid-end egraph rules? I'd expect that the egraph rewrites would be sufficient and the backends largely wouldn't need to change, unless they need to emit a new instruction pattern which isn't currently recognized.

One thing you may also want to do is to add something in tests/disas/*.wat. That's a wasm-level test which ensures that the right assembly will be generated which is a bit of an end-to-end test. Not required, but if you're curious that'd be a good way to validate that this is all kicking in correctly.

Exercises three consumers (if, select, eqz) over the icmp-mediated
shapes the egraph rewrites in `cranelift/codegen/src/opts/icmp.isle`
target: `(ctz X) == 0`, `(ctz X) != 0`, and the analogous clz forms,
across i32/i64.

The blessed disassembly shows:

- icmp-mediated cases collapse to a single bit test
  (`testl $1, %edx; jne` for ctz, `testl %edx, %edx; jl` for clz).
- a bare `if (ctz X)` / `if (clz X)` form (no icmp interposed,
  i.e. the wasm-natural shape produced by frontends like Motoko's
  `moc`) compiles to full bsf+cmov+test or bsr+cmov+sub+test, since
  the brif's implicit zero-test is not visible to the value-level
  egraph rules.
- `(ctz X) == 4` (numeric, not boolean) correctly stays as
  bsf+cmp+je — the rules don't over-fire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif ggreif requested a review from a team as a code owner May 10, 2026 09:14
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 10, 2026

Added tests/disas/ctz-clz-bool-condition.wat (commit 0519796) per your suggestion — covers if/select/eqz consumers over (ctz/clz X) eq/ne 0 for i32 and i64, plus a numeric-comparison negative test ((ctz X) == 4).

Empirical answer to your first question — the egraph rules in this PR are complete for the icmp-mediated case but don't catch the wasm-natural bare form:

icmp-mediated (this PR's target — collapses correctly):

;; if_ctz_eq0_i32:        testl $1, %edx; jne
;; select_ctz_eq0_i32:    testl $1, %edx; cmovne
;; eqz_ctz_eq0_i32:       testl $1, %edx; sete
;; if_clz_eq0_i32:        testl %edx, %edx; jl

bare if (ctz X) (no icmp — what wasm frontends like Motoko emit directly):

;; if_ctz_bare_i32:       bsfl %edx, %r9d; cmovel; testl %r9d, %r9d; jne
;; if_clz_bare_i32:       bsrl + cmovel + 0x1f-sub + test  ;; 5 insns

negative test (correctly not collapsed — confirms the rules don't over-fire on numeric comparisons):

;; if_ctz_eq4_i32:        bsfl %edx, %r9d; cmpl $4, %r9d; je

So the egraph rules cover their intended shape end-to-end. The bare form would need backend is_nonzero specializations as a follow-up — happy to land this PR with just the egraph rules and tackle the bare form separately, if that scoping makes sense to you.

Copy link
Copy Markdown
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good point, and yeah sounds good to me. Thanks for your work here! Happy to review backend-specific changes as well

@alexcrichton alexcrichton added this pull request to the merge queue May 11, 2026
Merged via the queue into bytecodealliance:main with commit 83ee70b May 11, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants